Selecting which texts belong in the corpus you plan to analyze (and which ones don't!) is a major interpretive problem. This process of selection is closely tied to the definition of our research question. At the same time, when we seek out patterns across texts, our scholarly arguments are strongest when they hold true across reasonable variations in selection critera.
For example, say that we perform a distant reading of Shakespeare's Comedies. We analyze them computationaly and present our findings to a scholarly audience. However, during Q&A, an objection is raised that the late romances like The Tempest and the problem plays like Measure for Measure are having a disporportionate impact on the pattens we have discovered. Their status as comedy is subject to debate. Clearly our claim about the Comedies as a whole has been invalidated!
We can anticipate this kind of objection by testing variations in our selection criteria. If a pattern holds true across variations that reflect scholarly debates about the categories themselves, then it offers a strong argumentative foothold. Alternately, if the linguistic pattern changes with variant corpora, then it offers a wider view of the discursive field.
In this set of exercises, we will identify words that are distinctive of Shakespeare's Comedies, as opposed to the Tragedies and Histories. The corpus for this task is the set of Shakespeare's plays, stripped of all character names and stage directions. Only dialogue remains. These have been made available by Michael Witmore from the Folger Digital Texts collection.
First, we will perform our distinctive word test using the three genres as they are assigned to the plays in the First Folio.
COMEDY
HISTORY
TRAGEDY
Second, we will repeat the process using slightly different categories. In addition to COMEDY, HISTORY, and TRAGEDY, we will include ROMANCE and PROBLEM. Several plays will be shifted into these latter, contested categories.
ROMANCE
PROBLEM
Note: Pericles and Two Noble Kinsmen were not included in the First Folio, both have been argued to be romances. How will you handle this in your labeling?
In [ ]:
import os
# Get a list of filenames for the corpus
filenames = os.listdir("corpora/FDT Shakespeare Stripped/")
# Read the files
texts = [ open("corpora/FDT Shakespeare Stripped/"+filename, 'rb').read() for filename in filenames ]
In [ ]:
In [ ]:
In [ ]:
In [ ]: